knitr::opts_chunk$set(echo = TRUE)
require(dplyr)
require(ggplot2)
require(kableExtra)
require(reactable)
require(reactablefmtr)
require(CSGo)
require(readxl)
require(caret)
require(corrplot)

library(reshape2)

Introduction

The goal of this project is to demonstrate some concepts that we learned in the “Regression Models” course applied to real data. For that purpose we are going to use a data set from House Prices in King County provided by the professor. We are going to explore this data set ether descriptively, and also testing some Regression Models to best predict the predict the house prices in King County.

About the data

This data set includes 21613 rows one for each combination of id and date (there are 175 ids repited twice and 1 id repited three times) and 21 columns. The data is from 2014-05-02 to 2015-05-14.

  • id: Identification number of the property (this variable is irrelevant for the analysis)

  • date: Date house was sold

  • price: Price (the prediction target)

  • bedrooms: Number of Bedrooms/House

  • bathrooms: Number of Bathrooms/Bedrooms

  • sqft living: square footage of the home

  • sqft lot: square footage of the lot

  • floors: Total floors (levels) in house

  • waterfront: House which has a view to a waterfront

  • view: Has been viewed

  • condition: How good the condition is (larger values mean better condition)

  • grade: overall grade given to the housing unit, based on King County grading system (larger grades are better)

  • sqft above: square footage of house apart from basement

  • sqft basement: square footage of the basement

  • yr built: Built Year

  • yr renovated: Year when house was renovated

  • zipcode: zip code

  • lat: Latitude coordinate

  • long: Longitude coordinate

  • sqft living15: Living room area in 2015 (implies? some renovations) This might or might not have affected the lotsize area

  • sqft lot15: lotSize area in 2015 (implies? some renovations)

Below it is possible to see a sample of the data:

Data cleaning

As we don’t want to have temporal dependencies in our dataset, the very first step in order to clean it would be to remove the duplicates id overtime. So, for the repetitive ids we are going to consider only the most actual price. The code below presents this step.

The second step we are going to perform will be to divide our data set in train and test. The train data set will be used to the descriptive and also to train the regression models, it will contains 80% of our data. The test data set will be the remaining 20% data that we will use only to evaluate how our model will perform with unknown data.

require(caret)

# create Data Partition
set.seed(1992)
intrain <- createDataPartition(data$price, p = 0.8, list = FALSE)

db_train_raw <- data[intrain,]
db_test_raw <- data[-intrain,]

Basic Descriptive

In order to understand and summarize our target variable data set we are going to perform some descriptive analysis in our train data set as follows.

library(GGally)

roundUp <- function(x) 10^ceiling(log10(x))

db_train = db_train_raw %>% 
  mutate(
    bathrooms = as.numeric(bathrooms)
    , floors = as.numeric(floors)
    , waterfront = as.factor(as.numeric(as.factor(waterfront)) - 1)
    , lat = as.numeric(lat)
    , long = as.numeric(long)
  ) %>% 
  mutate(
    lat = lat / (roundUp(abs(lat / 47))/10)
    , long = long / (roundUp(abs(long / 120))/10)
  ) %>% 
  dplyr::select(-id, -date) %>% 
  data.frame()

Continuous Variables

Below it is presented some descriptive analysis for the continuous variables.

price

The average price is 5.4196302^{5} USD, the median is equal to 4.5^{5} years old, the oldest player is 7.7^{6} USD, the youngest player is 7.5^{4} USD, and the standard deviation is 3.6790934^{5} USD.

Below it is represented the histogram and the density of the variable. It is possible to see that the distribution of price is asymmetric with a tail in the right.

require(ggplot2)
require(CSGo)

avg <- round(mean(db_train$price),2)
median <- round(median(db_train$price),2)

# histogram and density
db_train %>%
  ggplot(aes(x = price)) +
  geom_histogram(
    aes(y = ..density..),
    bins = 30, 
    fill = "#509997", 
    color = "black") +
  geom_density(color = "#761954", size = 1) +
  geom_vline(
    xintercept = avg, 
    linetype = "dashed", 
    color = "#c62d64",
    size = 1.5) +
  geom_text(aes(avg, 0, label = paste("Average:", avg), hjust = -0.2, vjust = -1), color = "#c62d64") +
  geom_vline(
    xintercept = median, 
    linetype = "dotted", 
    color = "#87ca9d",
    size = 1.5) +
  geom_text(aes(median, 0, label = paste("Median:", median), hjust = 1.2, vjust = -1), color = "#87ca9d") +
  ggtitle("Distribution of player's age") +
  CSGo::theme_csgo() +
  ylab("Price (USD)") +
  xlab("Density")

Lets also create a normal qqplot to see if the prices fits a normal distribution.

The normal qqplot compares the theoretical quantile from a normal distribution with the quantile from our sample data, in our case the price variable. If the price variable follows a normal distribution we expect to see most of the points above the reference line (Theoretical = Sample). However, based on the normal qqplot, we can assume that the price variable not follows a normal distribution at all.

sqft_living

The average sqft_living is 2083.5 USD, the median is equal to 1914 years old, the oldest player is 1.354^{4} USD, the youngest player is 370 USD, and the standard deviation is 919.96 USD.

Below it is represented the histogram and the density of the variable. It is possible to see that the distribution of sqft_living is kind of symmetric but with some atypical values to the right.

avg <- round(mean(db_train$sqft_living),2)
median <- round(median(db_train$sqft_living),2)

# histogram and density
db_train %>%
  ggplot(aes(x = sqft_living)) +
  geom_histogram(
    aes(y = ..density..),
    bins = 30, 
    fill = "#509997", 
    color = "black") +
  geom_density(color = "#761954", size = 1) +
  geom_vline(
    xintercept = avg, 
    linetype = "dashed", 
    color = "#c62d64",
    size = 1.5) +
  geom_text(aes(avg, 0, label = paste("Average:", avg), hjust = -0.2, vjust = -1), color = "#c62d64") +
  geom_vline(
    xintercept = median, 
    linetype = "dotted", 
    color = "#87ca9d",
    size = 1.5) +
  geom_text(aes(median, 0, label = paste("Median:", median), hjust = 1.2, vjust = -1), color = "#87ca9d") +
  ggtitle("Distribution of player's age") +
  CSGo::theme_csgo() +
  ylab("Sqft Living") +
  xlab("Density")

sqft_lot

The average sqft_lot is 1.480518^{4} USD, the median is equal to 7620 years old, the oldest player is 1.164794^{6} USD, the youngest player is 520 USD, and the standard deviation is 3.830935^{4} USD.

Below it is represented the histogram and the density of the variable. It is possible to see that the distribution of sqft_lot is asymmetric with some atypical values to the right.

avg <- round(mean(db_train$sqft_lot),2)
median <- round(median(db_train$sqft_lot),2)

# histogram and density
db_train %>%
  ggplot(aes(x = sqft_lot)) +
  geom_histogram(
    aes(y = ..density..),
    bins = 30, 
    fill = "#509997", 
    color = "black") +
  geom_density(color = "#761954", size = 1) +
  geom_vline(
    xintercept = avg, 
    linetype = "dashed", 
    color = "#c62d64",
    size = 1.5) +
  geom_text(aes(avg, 0, label = paste("Average:", avg), hjust = -0.2, vjust = -1), color = "#c62d64") +
  geom_vline(
    xintercept = median, 
    linetype = "dotted", 
    color = "#87ca9d",
    size = 1.5) +
  geom_text(aes(median, 0, label = paste("Median:", median), hjust = 1.2, vjust = -1), color = "#87ca9d") +
  ggtitle("Distribution of player's age") +
  CSGo::theme_csgo() +
  ylab("Sqft Lotg") +
  xlab("Density")

sqft_above

The average sqft_above is 1791.08 USD, the median is equal to 1560 years old, the oldest player is 9410 USD, the youngest player is 370 USD, and the standard deviation is 828.65 USD.

Below it is represented the histogram and the density of the variable. It is possible to see that the distribution of sqft_above is kind of symmetric but with some atypical values to the right.

avg <- round(mean(db_train$sqft_above),2)
median <- round(median(db_train$sqft_above),2)

# histogram and density
db_train %>%
  ggplot(aes(x = sqft_above)) +
  geom_histogram(
    aes(y = ..density..),
    bins = 30, 
    fill = "#509997", 
    color = "black") +
  geom_density(color = "#761954", size = 1) +
  geom_vline(
    xintercept = avg, 
    linetype = "dashed", 
    color = "#c62d64",
    size = 1.5) +
  geom_text(aes(avg, 0, label = paste("Average:", avg), hjust = -0.2, vjust = -1), color = "#c62d64") +
  geom_vline(
    xintercept = median, 
    linetype = "dotted", 
    color = "#87ca9d",
    size = 1.5) +
  geom_text(aes(median, 0, label = paste("Median:", median), hjust = 1.2, vjust = -1), color = "#87ca9d") +
  ggtitle("Distribution of player's age") +
  CSGo::theme_csgo() +
  ylab("Sqft Above") +
  xlab("Density")

sqft_basement

The average sqft_basement is 292.41 USD, the median is equal to 0 years old, the oldest player is 4820 USD, the youngest player is 0 USD, and the standard deviation is 444.16 USD.

Below it is represented the histogram and the density of the variable. It is possible to see that the distribution of sqft_basement has a zero inflated, maybe it could be used as categorical variable indicating if the house has a basement or not, since we have in pour train dataset that 60.77% of the houses does not have a basement and 60.77% have.

avg <- round(mean(db_train$sqft_basement),2)
median <- round(median(db_train$sqft_basement),2)

# histogram and density
db_train %>%
  ggplot(aes(x = sqft_basement)) +
  geom_histogram(
    aes(y = ..density..),
    bins = 30, 
    fill = "#509997", 
    color = "black") +
  geom_density(color = "#761954", size = 1) +
  geom_vline(
    xintercept = avg, 
    linetype = "dashed", 
    color = "#c62d64",
    size = 1.5) +
  geom_text(aes(avg, 0, label = paste("Average:", avg), hjust = -0.2, vjust = -1), color = "#c62d64") +
  geom_vline(
    xintercept = median, 
    linetype = "dotted", 
    color = "#87ca9d",
    size = 1.5) +
  geom_text(aes(median, 0, label = paste("Median:", median), hjust = 1.2, vjust = -1), color = "#87ca9d") +
  ggtitle("Distribution of player's age") +
  CSGo::theme_csgo() +
  ylab("Sqft Basement") +
  xlab("Density")

Categorical Variables

Below it is presented some descriptive analysis for the categorical variables against our target variable price.

bedrooms

First let’s take a look in the frequency table bellow.

# freq table
table(db_train$bedrooms)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   33 
##   12  153 2183 7794 5473 1275  209   33    8    6    3    1    1

Based on the table above we have decided to joing the more than 8 in the same category.

Below is presented the boxplot of price by bedrooms.

db_train_linear <- db_train %>%
  mutate(
    bedrooms_cat = ifelse(bedrooms >=8, "more than 8", as.factor(bedrooms))
  )

# bar plot
db_train_linear %>%
  ggplot(aes(x = bedrooms_cat, y = price, fill = bedrooms_cat)) +
  geom_boxplot(color = "black") +
  ggtitle("Price by Number of bedrooms") +
  ylab("Price") +
  xlab("Number of bedrooms") +
  labs(fill = "Number of bedrooms") +
  CSGo::scale_fill_csgo()+
  CSGo::theme_csgo()

The boxplot shows that the bigger the number of bedrooms bigger the price.

bathrooms

First let’s take a look in the frequency table bellow.

# freq table
table(db_train$bathrooms)
## 
##    0  0.5 0.75    1 1.25  1.5 1.75    2 2.25  2.5 2.75    3 3.25  3.5 3.75    4 
##    9    3   57 3042    6 1140 2416 1546 1614 4282  943  587  470  587  129  111 
## 4.25  4.5 4.75    5 5.25  5.5 5.75    6 6.25  6.5 6.75  7.5 7.75    8 
##   62   76   21   16   11    7    2    6    2    1    1    1    1    2

Based on the table above we have decided to joing the more than 4 in the same category and less than 1 in other category.

Below is presented the boxplot of price by bathrooms

db_train_linear <- db_train_linear %>%
  mutate(
    bathrooms = as.numeric(bathrooms),
    
    bathrooms_cat = as.factor(case_when(
      bathrooms >= 4 ~ "more than 4",
      bathrooms < 1 ~ "0 less than 1",
      TRUE ~ as.character(bathrooms))
  ))

# bar plot
db_train_linear %>%
  ggplot(aes(x = bathrooms_cat, y = price, fill = bathrooms_cat)) +
  geom_boxplot(color = "black") +
  ggtitle("Price by Number of bathrooms") +
  ylab("Price") +
  xlab("Number of bathrooms") +
  labs(fill = "Number of bathrooms") +
  CSGo::scale_fill_csgo()+
  CSGo::theme_csgo()

The boxplot shows that the bigger the number of bathrooms bigger the price.

floors

First let’s take a look in the frequency table bellow.

# freq table
table(db_train$floors)
## 
##    1  1.5    2  2.5    3  3.5 
## 8448 1509 6569  130  490    5

Based on the table above we have decided to joing the more than 2 in the same category.

Below is presented the boxplot of price by floors

db_train_linear <- db_train_linear %>%
  mutate(
   floors_cat = as.factor(case_when(
      floors >= 2 ~ "more than 2",
      TRUE ~ as.character(floors))
  ))

# bar plot
db_train_linear %>%
  ggplot(aes(x = floors_cat, y = price, fill = floors_cat)) +
  geom_boxplot(color = "black") +
  ggtitle("Price by Number of floors") +
  ylab("Price") +
  xlab("Number of floors") +
  labs(fill = "Number of floors") +
  CSGo::scale_fill_csgo()+
  CSGo::theme_csgo()

The boxplot shows that the number of floors does not influential in the price.

waterfront

First let’s take a look in the frequency table bellow.

# freq table
table(db_train$waterfront)
## 
##     0     1 
## 17018   133

Based on the table above can see that we have very unbalanced classes for the waterfront variable. The most of the data is FALSE. However the boxplot bellow suggests that when the waterfront is TRUE the price of the houses are bigger.

Below is presented the boxplot of price by waterfront.

# bar plot
db_train %>%
  ggplot(aes(x = waterfront, y = price, fill = waterfront)) +
  geom_boxplot(color = "black") +
  ggtitle("Price by waterfront") +
  ylab("Price") +
  xlab("waterfronts") +
  labs(fill = "waterfront") +
  CSGo::scale_fill_csgo()+
  CSGo::theme_csgo()

The boxplot shows when the waterfront is TRUE the higher will be the prices.

view

First let’s take a look in the frequency table bellow.

# freq table
table(db_train$view)
## 
##     0     1     2     3     4 
## 15450   272   770   407   252

Below is presented the boxplot of price by view.

db_train_linear <- db_train_linear %>%
  mutate(
   view = as.factor(view)
  )

# bar plot
db_train_linear %>%
  ggplot(aes(x = view, y = price, fill = view)) +
  geom_boxplot(color = "black") +
  ggtitle("Price by view") +
  ylab("Price") +
  xlab("view") +
  labs(fill = "view") +
  CSGo::scale_fill_csgo()+
  CSGo::theme_csgo()

The boxplot shows that bigger the view bigger the price.

condition

First let’s take a look in the frequency table bellow.

# freq table
table(db_train$condition)
## 
##     1     2     3     4     5 
##    20   129 11120  4526  1356

Based on the table above we decided to joing the categories 1 and 2 into the same category.

db_train_linear <- db_train_linear %>%
  mutate(
   condition_cat = ifelse(as.numeric(condition) <= 2, "less than 2", as.factor(condition))
  )

# bar plot
db_train_linear %>%
  ggplot(aes(x = condition_cat, y = price, fill = condition_cat)) +
  geom_boxplot(color = "black") +
  ggtitle("Price by condition") +
  ylab("Price") +
  xlab("condition") +
  labs(fill = "condition") +
  CSGo::scale_fill_csgo()+
  CSGo::theme_csgo()

The boxplot shows that the condition has no influential in the price.

grade

First let’s take a look in the frequency table bellow.

# freq table
table(db_train$grade)
## 
##    3    4    5    6    7    8    9   10   11   12   13 
##    3   20  183 1577 7147 4832 2075  921  310   71   12

Based on the table above we decided to joing the categories 3, 4 and 5 into the same category and the 11, 12, 13 in other category.

db_train_linear <- db_train_linear %>%
  mutate(
    grade = as.numeric(grade),
    
    grade_cat = as.factor(case_when(
      grade <= 5 ~ "less than 5",
      grade >= 11 ~ "mores than 11",
      TRUE ~ as.character(grade))
  ))

# bar plot
db_train_linear %>%
  ggplot(aes(x = as.factor(grade_cat), y = price, fill = as.factor(grade_cat))) +
  geom_boxplot(color = "black") +
  ggtitle("Price by grade") +
  ylab("Price") +
  xlab("grade") +
  labs(fill = "grade") +
  CSGo::scale_fill_csgo()+
  CSGo::theme_csgo()

The boxplot shows that bigger the grades bigger the price.

Correlation

Now we are going to perform a correlation study to see if the are any continuous explanatory variables are related to each other and also the most correlated variables with the price.

require(corrplot)

db_train_contin <- db_train_linear %>%
  select(
    price,
    sqft_living,
    sqft_lot,
    sqft_above,
    sqft_basement
  )

cor_result <- cor(db_train_contin, method = "pearson")

corrplot::corrplot.mixed(cor_result)

It is possible to see that sqft_living and sqft_above presented a high correlation (0.88), so to avoid multicolinearty problems we will not consider the sqft_above in our model, this choice is also based on that the sqft_living has the higher correlation to our target variable (price) compared to sqft_above.

Transformation

pairs(data %>% select_if(is.numeric) %>% select(-sqft_above)) 

Modelling

Linear Model

In this session we will train a linear regression to predict our target variable price. We will also run a step wise selection.

Let’s first define our start formula. We will not consider the following variables: sqft_above, condition, and floors.

# formula
form <- formula(price ~ sqft_living + sqft_basement + sqft_lot + grade_cat + view + waterfront + bathrooms_cat + bedrooms_cat)

Now let’s train the model using a linear regression method and select only the most significant predictors:

# model
#lmStepAIC
set.seed(1992)

mod <- train(
  form,
  data = db_train_linear, 
  method = "lmStepAIC",
  trace=FALSE,
  na.action = na.omit)

summary(mod)
## 
## Call:
## lm(formula = .outcome ~ sqft_living + sqft_basement + sqft_lot + 
##     grade_cat6 + grade_cat7 + grade_cat8 + grade_cat9 + `grade_catless than 5` + 
##     `grade_catmores than 11` + view1 + view2 + view3 + view4 + 
##     waterfront1 + bathrooms_cat1 + bathrooms_cat1.5 + bathrooms_cat1.75 + 
##     bathrooms_cat2 + bathrooms_cat2.5 + bathrooms_cat2.75 + bathrooms_cat3.25 + 
##     bathrooms_cat3.75 + `bathrooms_catmore than 4` + bedrooms_cat2 + 
##     bedrooms_cat3 + bedrooms_cat4 + bedrooms_cat8 + `bedrooms_catmore than 8`, 
##     data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1570917  -115050   -18995    91454  4535287 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 4.253e+05  1.579e+04  26.930  < 2e-16 ***
## sqft_living                 1.618e+02  4.090e+00  39.558  < 2e-16 ***
## sqft_basement               4.350e+01  4.731e+00   9.194  < 2e-16 ***
## sqft_lot                   -3.642e-01  4.546e-02  -8.011 1.21e-15 ***
## grade_cat6                 -4.141e+05  1.202e+04 -34.455  < 2e-16 ***
## grade_cat7                 -3.504e+05  9.969e+03 -35.149  < 2e-16 ***
## grade_cat8                 -2.678e+05  9.253e+03 -28.942  < 2e-16 ***
## grade_cat9                 -1.448e+05  9.122e+03 -15.879  < 2e-16 ***
## `grade_catless than 5`     -4.389e+05  1.924e+04 -22.808  < 2e-16 ***
## `grade_catmores than 11`    3.156e+05  1.417e+04  22.266  < 2e-16 ***
## view1                       1.350e+05  1.369e+04   9.857  < 2e-16 ***
## view2                       7.958e+04  8.376e+03   9.500  < 2e-16 ***
## view3                       1.303e+05  1.152e+04  11.307  < 2e-16 ***
## view4                       3.191e+05  1.819e+04  17.539  < 2e-16 ***
## waterfront1                 4.639e+05  2.444e+04  18.981  < 2e-16 ***
## bathrooms_cat1              5.906e+04  7.322e+03   8.066 7.72e-16 ***
## bathrooms_cat1.5            3.068e+04  8.273e+03   3.709 0.000209 ***
## bathrooms_cat1.75           2.143e+04  6.514e+03   3.290 0.001004 ** 
## bathrooms_cat2              2.874e+04  7.322e+03   3.925 8.69e-05 ***
## bathrooms_cat2.5           -4.953e+04  5.467e+03  -9.060  < 2e-16 ***
## bathrooms_cat2.75          -2.269e+04  8.389e+03  -2.704 0.006849 ** 
## bathrooms_cat3.25           5.470e+04  1.125e+04   4.862 1.17e-06 ***
## bathrooms_cat3.75           9.913e+04  2.031e+04   4.880 1.07e-06 ***
## `bathrooms_catmore than 4`  2.025e+05  1.477e+04  13.710  < 2e-16 ***
## bedrooms_cat2               6.884e+04  1.919e+04   3.587 0.000335 ***
## bedrooms_cat3               6.997e+04  6.770e+03  10.335  < 2e-16 ***
## bedrooms_cat4               2.252e+04  4.240e+03   5.311 1.10e-07 ***
## bedrooms_cat8              -6.858e+04  3.921e+04  -1.749 0.080349 .  
## `bedrooms_catmore than 8`   1.007e+05  5.133e+04   1.962 0.049817 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 221700 on 17122 degrees of freedom
## Multiple R-squared:  0.6373, Adjusted R-squared:  0.6367 
## F-statistic:  1075 on 28 and 17122 DF,  p-value: < 2.2e-16

So, the linear model will be:

mod$finalModel
## 
## Call:
## lm(formula = .outcome ~ sqft_living + sqft_basement + sqft_lot + 
##     grade_cat6 + grade_cat7 + grade_cat8 + grade_cat9 + `grade_catless than 5` + 
##     `grade_catmores than 11` + view1 + view2 + view3 + view4 + 
##     waterfront1 + bathrooms_cat1 + bathrooms_cat1.5 + bathrooms_cat1.75 + 
##     bathrooms_cat2 + bathrooms_cat2.5 + bathrooms_cat2.75 + bathrooms_cat3.25 + 
##     bathrooms_cat3.75 + `bathrooms_catmore than 4` + bedrooms_cat2 + 
##     bedrooms_cat3 + bedrooms_cat4 + bedrooms_cat8 + `bedrooms_catmore than 8`, 
##     data = dat)
## 
## Coefficients:
##                (Intercept)                 sqft_living  
##                  4.253e+05                   1.618e+02  
##              sqft_basement                    sqft_lot  
##                  4.350e+01                  -3.642e-01  
##                 grade_cat6                  grade_cat7  
##                 -4.141e+05                  -3.504e+05  
##                 grade_cat8                  grade_cat9  
##                 -2.678e+05                  -1.448e+05  
##     `grade_catless than 5`    `grade_catmores than 11`  
##                 -4.389e+05                   3.156e+05  
##                      view1                       view2  
##                  1.350e+05                   7.958e+04  
##                      view3                       view4  
##                  1.303e+05                   3.191e+05  
##                waterfront1              bathrooms_cat1  
##                  4.639e+05                   5.906e+04  
##           bathrooms_cat1.5           bathrooms_cat1.75  
##                  3.068e+04                   2.143e+04  
##             bathrooms_cat2            bathrooms_cat2.5  
##                  2.874e+04                  -4.953e+04  
##          bathrooms_cat2.75           bathrooms_cat3.25  
##                 -2.269e+04                   5.470e+04  
##          bathrooms_cat3.75  `bathrooms_catmore than 4`  
##                  9.913e+04                   2.025e+05  
##              bedrooms_cat2               bedrooms_cat3  
##                  6.884e+04                   6.997e+04  
##              bedrooms_cat4               bedrooms_cat8  
##                  2.252e+04                  -6.858e+04  
##  `bedrooms_catmore than 8`  
##                  1.007e+05

And below is presented the results of this model:

mod$results
##   parameter     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      none 223205.1 0.6310368 146329.2 4937.735 0.01012315 1407.002

Model Evaluation

In this session we are going to apply the model above in our test data set to see how it performs in a new data set.

First let’s transform some categories we created before in our test data set, and then apply our model to it.

db_test_linear <- db_test_raw %>%
  mutate(
    bedrooms_cat = ifelse(bedrooms >=8, "more than 8", as.factor(bedrooms)),
    
    bathrooms = as.numeric(bathrooms),
    
    view = as.factor(view),
    
    waterfront = as.factor(as.numeric(as.factor(waterfront)) - 1),
    
    bathrooms_cat = as.factor(case_when(
      bathrooms >= 4 ~ "more than 4",
      bathrooms < 1 ~ "0 less than 1",
      TRUE ~ as.character(bathrooms))
    ),
    
    grade = as.numeric(grade),
    
    grade_cat = as.factor(case_when(
      grade <= 5 ~ "less than 5",
      grade >= 11 ~ "mores than 11",
      TRUE ~ as.character(grade))
    )
  )

db_test_linear <- db_test_linear %>%
  mutate(
    pred = predict(mod, db_test_linear)
  )

Now let’s plot the predicted vs observed.

db_test_linear %>% ggplot(aes(x = price, y = pred)) +
  geom_point() +
  CSGo::theme_csgo()

As we can see above, our lineal model tends to predict the prize of the property larger than the actual price. Ideally, the points should be around the diagonal and in this case most of them are above it.

Conclusion

The model developed to predict the price was considered satisfactory. However the use of generalized additive models could improve the results, as there are some predictors that they don’t have a linear relationship with the response.

Generalized Additive Models

As we saw in the previous step, we realized that the relationship between the predictors and the response variable, in many cases, is not linear. Therefore, we are going to using the Generalized Additive Models to see if this type of regression fits the data properly.

Densities

Let’s plot the densities function for each predictor at the very beginning:

db_train_raw %>%  
  dplyr::select(-id,-date) %>% 
  melt() %>% 
  ggplot(
    aes(
    x = value
    , col = variable
    )
) +
  geom_density() +
  facet_wrap(~variable, scales = "free")

There are some cases where some transformation are recommended, let’s plot the density functions after this:

db_train = db_train %>% 
  mutate(
    sqft_living = log(sqft_living)
    , sqft_lot = log(sqft_lot)
    , sqft_above = log(sqft_above)
    , sqft_living15 = log(sqft_living15)
    , sqft_lot15 = log(sqft_lot15)
    , has_basement = as.factor(ifelse(db_train$sqft_basement > 0, 1, 0))
  )
db_train %>% summary()
##      price            bedrooms        bathrooms      sqft_living   
##  Min.   :  75000   Min.   : 0.000   Min.   :0.000   Min.   :5.914  
##  1st Qu.: 324844   1st Qu.: 3.000   1st Qu.:1.750   1st Qu.:7.265  
##  Median : 450000   Median : 3.000   Median :2.250   Median :7.557  
##  Mean   : 541963   Mean   : 3.373   Mean   :2.117   Mean   :7.552  
##  3rd Qu.: 645000   3rd Qu.: 4.000   3rd Qu.:2.500   3rd Qu.:7.844  
##  Max.   :7700000   Max.   :33.000   Max.   :8.000   Max.   :9.513  
##     sqft_lot          floors      waterfront      view          condition    
##  Min.   : 6.254   Min.   :1.000   0:17018    Min.   :0.0000   Min.   :1.000  
##  1st Qu.: 8.524   1st Qu.:1.000   1:  133    1st Qu.:0.0000   1st Qu.:3.000  
##  Median : 8.939   Median :1.500              Median :0.0000   Median :3.000  
##  Mean   : 8.987   Mean   :1.496              Mean   :0.2356   Mean   :3.412  
##  3rd Qu.: 9.273   3rd Qu.:2.000              3rd Qu.:0.0000   3rd Qu.:4.000  
##  Max.   :13.968   Max.   :3.500              Max.   :4.0000   Max.   :5.000  
##      grade          sqft_above    sqft_basement       yr_built   
##  Min.   : 3.000   Min.   :5.914   Min.   :   0.0   Min.   :1900  
##  1st Qu.: 7.000   1st Qu.:7.090   1st Qu.:   0.0   1st Qu.:1952  
##  Median : 7.000   Median :7.352   Median :   0.0   Median :1975  
##  Mean   : 7.665   Mean   :7.397   Mean   : 292.4   Mean   :1971  
##  3rd Qu.: 8.000   3rd Qu.:7.701   3rd Qu.: 570.0   3rd Qu.:1997  
##  Max.   :13.000   Max.   :9.150   Max.   :4820.0   Max.   :2015  
##   yr_renovated        zipcode           lat             long       
##  Min.   :   0.00   Min.   :98001   Min.   :47.16   Min.   :-122.5  
##  1st Qu.:   0.00   1st Qu.:98033   1st Qu.:47.47   1st Qu.:-122.3  
##  Median :   0.00   Median :98065   Median :47.57   Median :-122.2  
##  Mean   :  85.18   Mean   :98078   Mean   :47.56   Mean   :-122.2  
##  3rd Qu.:   0.00   3rd Qu.:98117   3rd Qu.:47.68   3rd Qu.:-122.1  
##  Max.   :2015.00   Max.   :98199   Max.   :47.78   Max.   :-121.3  
##  sqft_living15     sqft_lot15     has_basement
##  Min.   :6.131   Min.   : 6.479   0:10423     
##  1st Qu.:7.307   1st Qu.: 8.537   1: 6728     
##  Median :7.518   Median : 8.939               
##  Mean   :7.541   Mean   : 8.958               
##  3rd Qu.:7.766   3rd Qu.: 9.219               
##  Max.   :8.734   Max.   :13.237
db_train %>% 
  melt() %>% 
  ggplot(
    aes(
    x = value
    , col = variable
    )
) +
  geom_density() +
  facet_wrap(~variable, scales = "free")
## Using waterfront, has_basement as id variables

Dealing with Collinearity

As we can see in the next correlation plot, there are some variables that are high correlated. Thus, in order to prevent collinearity, we are going to remove some variables:

corrplot(
  cor(db_train %>% select_if(is.numeric))
  , order = "hclust"
)

#Remove highly correlated variables to prevent collinearity 
corrplot(
  cor(db_train %>% dplyr::select(-grade, -sqft_above, -sqft_lot, -sqft_living) %>% select_if(is.numeric))
  , order = "hclust"
)

db_train = db_train %>% dplyr::select(-grade, -sqft_above, -sqft_lot, -sqft_living)

Relationship between response and covariates

In the next plots, we are going to see relationship between the response and the covariates.

db_train %>% 
  select_if(is.numeric) %>%
  melt(id = "price") %>% 
  ggplot(
    aes(y = price, x = value)
  ) + 
  geom_point() + 
  geom_smooth() +
  facet_wrap(~variable, scale = "free")

As we can see in the plots above, there is no linear relationship between price and the covariates.

Model Building (Step by step)

Let’s train a model which provides a good behavior under the conditions mentioned above.

library(AICcmodavg)
library(mgcv)
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-38. For overview type 'help("mgcv-package")'.
library(survival)
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
library(visreg)
#Highest Correlated Variable
model.gam.liv = mgcv::gam(log(price) ~ s(sqft_living15), data = db_train)
summary(model.gam.liv)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.052216   0.003131    4169   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df    F p-value    
## s(sqft_living15) 7.865  8.635 1263  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.389   Deviance explained = 38.9%
## GCV = 0.16817  Scale est. = 0.16809   n = 17151
anova(model.gam.liv)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15)
## 
## Approximate significance of smooth terms:
##                    edf Ref.df    F p-value
## s(sqft_living15) 7.865  8.635 1263  <2e-16
visreg(model.gam.liv, "sqft_living15")

model.gam.liv.bat = mgcv::gam(
  log(price) ~ s(sqft_living15) + 
    s(bathrooms)
  , data = db_train)

summary(model.gam.liv.bat)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.052216   0.002953    4420   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df     F p-value    
## s(sqft_living15) 6.922  7.960 575.9  <2e-16 ***
## s(bathrooms)     7.140  7.874 269.7  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.456   Deviance explained = 45.6%
## GCV = 0.14972  Scale est. = 0.14959   n = 17151
anova(model.gam.liv, model.gam.liv.bat)
## Analysis of Deviance Table
## 
## Model 1: log(price) ~ s(sqft_living15)
## Model 2: log(price) ~ s(sqft_living15) + s(bathrooms)
##   Resid. Df Resid. Dev     Df Deviance
## 1     17141     2881.3                
## 2     17134     2563.3 7.1997   318.07
visreg(model.gam.liv.bat, "sqft_living15")

visreg(model.gam.liv.bat, "bathrooms")

Above we can see the estimated smooth effect s of sqft_living and bathrooms.

model.gam.liv.bat.view = mgcv::gam(
  log(price) ~ s(sqft_living15) + 
    s(bathrooms) + 
    view
  , data = db_train)

summary(model.gam.liv.bat.view)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.024686   0.003028 4301.98   <2e-16 ***
## view         0.116848   0.003952   29.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df     F p-value    
## s(sqft_living15) 7.215  8.193 462.8  <2e-16 ***
## s(bathrooms)     7.029  7.786 254.8  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.482   Deviance explained = 48.3%
## GCV = 0.14248  Scale est. = 0.14234   n = 17151
visreg(model.gam.liv.bat.view, "sqft_living15")

visreg(model.gam.liv.bat.view, "bathrooms")

anova(model.gam.liv.bat, model.gam.liv.bat.view)
## Analysis of Deviance Table
## 
## Model 1: log(price) ~ s(sqft_living15) + s(bathrooms)
## Model 2: log(price) ~ s(sqft_living15) + s(bathrooms) + view
##   Resid. Df Resid. Dev     Df Deviance
## 1     17134     2563.3                
## 2     17133     2439.0 1.1449   124.28
rbind(
  c("~living15", AIC(model.gam.liv)) 
  , c("~living15 + bathrooms", AIC(model.gam.liv.bat))
  , c("~living15 + bathrooms + view", AIC(model.gam.liv.bat.view))
)
##      [,1]                           [,2]              
## [1,] "~living15"                    "18098.268228551" 
## [2,] "~living15 + bathrooms"        "16104.5097962339"
## [3,] "~living15 + bathrooms + view" "15254.4711858212"

In the previous model, we have added the variable view as a predictor in the model.

model.gam.liv.bat.view.cbase = mgcv::gam(
  log(price) ~ s(sqft_living15) + 
    s(bathrooms) + 
    view + 
    s(sqft_basement)
  , data = db_train)

summary(model.gam.liv.bat.view.cbase)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.028720   0.003001  4342.0   <2e-16 ***
## view         0.099727   0.004020    24.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df      F p-value    
## s(sqft_living15) 7.029  8.049 482.81  <2e-16 ***
## s(bathrooms)     6.345  7.195 214.57  <2e-16 ***
## s(sqft_basement) 8.615  8.923  47.26  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.494   Deviance explained = 49.5%
## GCV = 0.13923  Scale est. = 0.13903   n = 17151
anova( model.gam.liv.bat.view, model.gam.liv.bat.view.cbase)
## Analysis of Deviance Table
## 
## Model 1: log(price) ~ s(sqft_living15) + s(bathrooms) + view
## Model 2: log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement)
##   Resid. Df Resid. Dev     Df Deviance
## 1     17133     2439.0                
## 2     17125     2381.2 8.1873   57.829
model.gam.liv.bat.view.fbase = mgcv::gam(
  log(price) ~ s(sqft_living15) + 
    s(bathrooms) + 
    view + 
    has_basement
  , data = db_train)

summary(model.gam.liv.bat.view.fbase)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + has_basement
## 
## Parametric coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   12.979134   0.003747 3464.16   <2e-16 ***
## view           0.102618   0.003969   25.85   <2e-16 ***
## has_basement1  0.124667   0.006171   20.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                   edf Ref.df     F p-value    
## s(sqft_living15) 7.03  8.049 506.1  <2e-16 ***
## s(bathrooms)     6.71  7.527 219.1  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.494   Deviance explained = 49.5%
## GCV = 0.13921  Scale est. = 0.13907   n = 17151
anova(model.gam.liv.bat.view.cbase, model.gam.liv.bat.view.fbase)
## Analysis of Deviance Table
## 
## Model 1: log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement)
## Model 2: log(price) ~ s(sqft_living15) + s(bathrooms) + view + has_basement
##   Resid. Df Resid. Dev      Df Deviance
## 1     17125     2381.2                 
## 2     17132     2382.9 -7.5904  -1.7326
rbind(
  c("~living15", AIC(model.gam.liv)) 
  , c("~living15 + bathrooms", AIC(model.gam.liv.bat))
  , c("~living15 + bathrooms + view", AIC(model.gam.liv.bat.view))
  , c("~living15 + bathrooms + view + sqft_basement", AIC(model.gam.liv.bat.view.cbase))
  , c("~living15 + bathrooms + view + has_basement", AIC(model.gam.liv.bat.view.fbase))
)
##      [,1]                                           [,2]              
## [1,] "~living15"                                    "18098.268228551" 
## [2,] "~living15 + bathrooms"                        "16104.5097962339"
## [3,] "~living15 + bathrooms + view"                 "15254.4711858212"
## [4,] "~living15 + bathrooms + view + sqft_basement" "14858.4149482997"
## [5,] "~living15 + bathrooms + view + has_basement"  "14856.3927235136"
model.gam.liv.bat.view.cbase.wtf = mgcv::gam(
  log(price) ~ s(sqft_living15) + 
    s(bathrooms) + 
    view + 
    s(sqft_basement) + 
    waterfront
  , data = db_train)

summary(model.gam.liv.bat.view.cbase.wtf)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + 
##     waterfront
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.030081   0.002993 4353.34   <2e-16 ***
## view         0.081237   0.004355   18.65   <2e-16 ***
## waterfront1  0.386204   0.035660   10.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df      F p-value    
## s(sqft_living15) 6.994  8.020 492.18  <2e-16 ***
## s(bathrooms)     6.191  7.061 219.98  <2e-16 ***
## s(sqft_basement) 8.668  8.941  49.32  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.498   Deviance explained = 49.8%
## GCV = 0.1383  Scale est. = 0.13809   n = 17151
anova(model.gam.liv.bat.view.cbase, model.gam.liv.bat.view.cbase.wtf)
## Analysis of Deviance Table
## 
## Model 1: log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement)
## Model 2: log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + 
##     waterfront
##   Resid. Df Resid. Dev      Df Deviance
## 1     17125     2381.2                 
## 2     17124     2365.0 0.85592   16.143
AIC(model.gam.liv.bat.view.cbase)
## [1] 14858.41
AIC(model.gam.liv.bat.view.cbase.wtf)
## [1] 14743.47
model.gam.liv.bat.view.cbase.wtf2 = mgcv::gam(
  log(price) ~ s(sqft_living15) + 
    s(bathrooms, by = waterfront) + 
    view + 
    s(sqft_basement) + 
    waterfront
  , data = db_train)

anova(model.gam.liv.bat.view.cbase.wtf,model.gam.liv.bat.view.cbase.wtf2)
## Analysis of Deviance Table
## 
## Model 1: log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + 
##     waterfront
## Model 2: log(price) ~ s(sqft_living15) + s(bathrooms, by = waterfront) + 
##     view + s(sqft_basement) + waterfront
##   Resid. Df Resid. Dev      Df Deviance
## 1     17124       2365                 
## 2     17123       2364 0.85106   1.0285
AIC(model.gam.liv.bat.view.cbase.wtf)
## [1] 14743.47
AIC(model.gam.liv.bat.view.cbase.wtf2)
## [1] 14737.69
model.gam.liv.bat.view.cbase.wtf2.bed = mgcv::gam(
  log(price) ~ s(sqft_living15) + 
    s(bathrooms, by = waterfront) + 
    view + 
    s(sqft_basement) + 
    waterfront  + 
    s(bedrooms)
  , data = db_train)

summary(model.gam.liv.bat.view.cbase.wtf2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms, by = waterfront) + 
##     view + s(sqft_basement) + waterfront
## 
## Parametric coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 13.030022   0.002993 4353.848   <2e-16 ***
## view         0.081420   0.004356   18.694   <2e-16 ***
## waterfront1  0.346587   0.038925    8.904   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                            edf Ref.df      F p-value    
## s(sqft_living15)         7.003  8.027 491.84  <2e-16 ***
## s(bathrooms):waterfront0 6.029  6.906 219.32  <2e-16 ***
## s(bathrooms):waterfront1 1.000  1.000  83.16  <2e-16 ***
## s(sqft_basement)         8.660  8.940  49.10  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.498   Deviance explained = 49.9%
## GCV = 0.13825  Scale est. = 0.13804   n = 17151
anova(model.gam.liv.bat.view.cbase.wtf2.bed)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms, by = waterfront) + 
##     view + s(sqft_basement) + waterfront + s(bedrooms)
## 
## Parametric Terms:
##            df      F p-value
## view        1 356.50  <2e-16
## waterfront  1  78.48  <2e-16
## 
## Approximate significance of smooth terms:
##                            edf Ref.df       F p-value
## s(sqft_living15)         6.914  7.955 467.174  <2e-16
## s(bathrooms):waterfront0 6.823  7.588 171.551  <2e-16
## s(bathrooms):waterfront1 1.000  1.000  84.445  <2e-16
## s(sqft_basement)         8.637  8.932  45.153  <2e-16
## s(bedrooms)              7.963  8.482   8.387  <2e-16
AIC(model.gam.liv.bat.view.cbase.wtf2)
## [1] 14737.69
AIC(model.gam.liv.bat.view.cbase.wtf2.bed)
## [1] 14679.05
model.gam.liv.bat.view.cbase.wtf2.bed.flo = mgcv::gam(
  log(price) ~ s(sqft_living15) + 
    s(bathrooms, by = waterfront) + 
    view + 
    s(sqft_basement, by = waterfront) + 
    waterfront  + 
    s(bedrooms) + 
    floors
  , data = db_train)

summary(model.gam.liv.bat.view.cbase.wtf2.bed.flo)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms, by = waterfront) + 
##     view + s(sqft_basement, by = waterfront) + waterfront + s(bedrooms) + 
##     floors
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.771877   0.010737 1189.49   <2e-16 ***
## view         0.078214   0.004293   18.22   <2e-16 ***
## waterfront1  0.354416   0.039911    8.88   <2e-16 ***
## floors       0.172952   0.006928   24.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                edf Ref.df       F p-value    
## s(sqft_living15)             6.894  7.938 488.141 < 2e-16 ***
## s(bathrooms):waterfront0     7.079  7.783  75.885 < 2e-16 ***
## s(bathrooms):waterfront1     1.000  1.000  44.399 < 2e-16 ***
## s(sqft_basement):waterfront0 7.840  8.386  98.808 < 2e-16 ***
## s(sqft_basement):waterfront1 1.206  1.382   7.576 0.00197 ** 
## s(bedrooms)                  8.101  8.587   8.428 < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.517   Deviance explained = 51.8%
## GCV = 0.13296  Scale est. = 0.13268   n = 17151
anova(model.gam.liv.bat.view.cbase.wtf2.bed.flo)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms, by = waterfront) + 
##     view + s(sqft_basement, by = waterfront) + waterfront + s(bedrooms) + 
##     floors
## 
## Parametric Terms:
##            df      F p-value
## view        1 331.90  <2e-16
## waterfront  1  78.86  <2e-16
## floors      1 623.14  <2e-16
## 
## Approximate significance of smooth terms:
##                                edf Ref.df       F p-value
## s(sqft_living15)             6.894  7.938 488.141 < 2e-16
## s(bathrooms):waterfront0     7.079  7.783  75.885 < 2e-16
## s(bathrooms):waterfront1     1.000  1.000  44.399 < 2e-16
## s(sqft_basement):waterfront0 7.840  8.386  98.808 < 2e-16
## s(sqft_basement):waterfront1 1.206  1.382   7.576 0.00197
## s(bedrooms)                  8.101  8.587   8.428 < 2e-16
AIC(model.gam.liv.bat.view.cbase.wtf2.bed)
## [1] 14679.05
AIC(model.gam.liv.bat.view.cbase.wtf2.bed.flo)
## [1] 14068.3
model.gam.liv.bat.view.cbase.wtf2.bed.flo.loc = mgcv::gam(
  log(price) ~ s(sqft_living15) + 
    s(bathrooms, by = waterfront) + 
    view + 
    s(sqft_basement, by = waterfront) + 
    waterfront  + 
    s(bedrooms) + 
    floors + 
    s(lat) +
    s(long)
  , data = db_train)

summary(model.gam.liv.bat.view.cbase.wtf2.bed.flo.loc)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms, by = waterfront) + 
##     view + s(sqft_basement, by = waterfront) + waterfront + s(bedrooms) + 
##     floors + s(lat) + s(long)
## 
## Parametric coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 12.981817   0.007557 1717.812  < 2e-16 ***
## view         0.070976   0.002969   23.905  < 2e-16 ***
## waterfront1  0.486422   0.029114   16.708  < 2e-16 ***
## floors       0.033176   0.004878    6.801 1.07e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                edf Ref.df        F p-value    
## s(sqft_living15)             4.409  5.485 1048.602  <2e-16 ***
## s(bathrooms):waterfront0     6.713  7.491  185.949  <2e-16 ***
## s(bathrooms):waterfront1     5.970  6.766   14.376  <2e-16 ***
## s(sqft_basement):waterfront0 8.906  8.993   18.196  <2e-16 ***
## s(sqft_basement):waterfront1 1.000  1.000    3.123  0.0772 .  
## s(bedrooms)                  5.582  6.436   49.819  <2e-16 ***
## s(lat)                       8.942  8.999 2159.931  <2e-16 ***
## s(long)                      8.872  8.991   82.008  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.78   Deviance explained = 78.1%
## GCV = 0.06064  Scale est. = 0.060448  n = 17151
anova(model.gam.liv.bat.view.cbase.wtf2.bed.flo.loc)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms, by = waterfront) + 
##     view + s(sqft_basement, by = waterfront) + waterfront + s(bedrooms) + 
##     floors + s(lat) + s(long)
## 
## Parametric Terms:
##            df      F  p-value
## view        1 571.45  < 2e-16
## waterfront  1 279.15  < 2e-16
## floors      1  46.26 1.07e-11
## 
## Approximate significance of smooth terms:
##                                edf Ref.df        F p-value
## s(sqft_living15)             4.409  5.485 1048.602  <2e-16
## s(bathrooms):waterfront0     6.713  7.491  185.949  <2e-16
## s(bathrooms):waterfront1     5.970  6.766   14.376  <2e-16
## s(sqft_basement):waterfront0 8.906  8.993   18.196  <2e-16
## s(sqft_basement):waterfront1 1.000  1.000    3.123  0.0772
## s(bedrooms)                  5.582  6.436   49.819  <2e-16
## s(lat)                       8.942  8.999 2159.931  <2e-16
## s(long)                      8.872  8.991   82.008  <2e-16
AIC(model.gam.liv.bat.view.cbase.wtf2.bed.flo)
## [1] 14068.3
AIC(model.gam.liv.bat.view.cbase.wtf2.bed.flo.loc)
## [1] 603.4191

Adding Lat and Long decreased the AIC value from 14068 to 603 (96% less) and explained 27% of the deviance. And the interaction with waterfront doesnt seem to be relevant anymore.

model.gam.liv.bat.view.cbase.wtf.bed.flo.loc = mgcv::gam(
  log(price) ~ s(sqft_living15) + 
    s(bathrooms) + 
    view + 
    s(sqft_basement) + 
    waterfront  + 
    s(bedrooms) + 
    floors + 
    s(lat) +
    s(long)
  , data = db_train)

summary(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + 
##     waterfront + s(bedrooms) + floors + s(lat) + s(long)
## 
## Parametric coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 12.982527   0.007554 1718.554  < 2e-16 ***
## view         0.070884   0.002967   23.893  < 2e-16 ***
## waterfront1  0.495719   0.023912   20.731  < 2e-16 ***
## floors       0.032845   0.004877    6.735 1.69e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df       F p-value    
## s(sqft_living15) 4.362  5.432 1066.93  <2e-16 ***
## s(bathrooms)     7.249  7.950  179.68  <2e-16 ***
## s(sqft_basement) 4.512  5.477   27.34  <2e-16 ***
## s(bedrooms)      5.224  6.087   52.57  <2e-16 ***
## s(lat)           8.937  8.999 2161.40  <2e-16 ***
## s(long)          8.878  8.991   81.47  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.78   Deviance explained =   78%
## GCV = 0.060691  Scale est. = 0.060539  n = 17151
anova(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + 
##     waterfront + s(bedrooms) + floors + s(lat) + s(long)
## 
## Parametric Terms:
##            df      F  p-value
## view        1 570.87  < 2e-16
## waterfront  1 429.78  < 2e-16
## floors      1  45.36 1.69e-11
## 
## Approximate significance of smooth terms:
##                    edf Ref.df       F p-value
## s(sqft_living15) 4.362  5.432 1066.93  <2e-16
## s(bathrooms)     7.249  7.950  179.68  <2e-16
## s(sqft_basement) 4.512  5.477   27.34  <2e-16
## s(bedrooms)      5.224  6.087   52.57  <2e-16
## s(lat)           8.937  8.999 2161.40  <2e-16
## s(long)          8.878  8.991   81.47  <2e-16
AIC(model.gam.liv.bat.view.cbase.wtf2.bed.flo.loc)
## [1] 603.4191
AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc)
## [1] 618.0408
model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip = mgcv::gam(
  log(price) ~ s(sqft_living15) + 
    s(bathrooms) + 
    view + 
    s(sqft_basement) + 
    waterfront  + 
    s(bedrooms) + 
    floors + 
    s(lat) +
    s(long) + 
    s(zipcode)
  , data = db_train)

summary(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + 
##     waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode)
## 
## Parametric coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 12.992193   0.007528 1725.832  < 2e-16 ***
## view         0.071146   0.002940   24.201  < 2e-16 ***
## waterfront1  0.518459   0.023621   21.949  < 2e-16 ***
## floors       0.026226   0.004865    5.391 7.11e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df       F p-value    
## s(sqft_living15) 5.043  6.174  999.67  <2e-16 ***
## s(bathrooms)     7.290  7.977  183.47  <2e-16 ***
## s(sqft_basement) 5.245  6.226   20.44  <2e-16 ***
## s(bedrooms)      5.455  6.315   56.34  <2e-16 ***
## s(lat)           8.837  8.991 1685.72  <2e-16 ***
## s(long)          8.564  8.906   64.19  <2e-16 ***
## s(zipcode)       8.943  8.998   59.53  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.786   Deviance explained = 78.7%
## GCV = 0.058921  Scale est. = 0.058738  n = 17151
anova(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + 
##     waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode)
## 
## Parametric Terms:
##            df      F  p-value
## view        1 585.69  < 2e-16
## waterfront  1 481.77  < 2e-16
## floors      1  29.06 7.11e-08
## 
## Approximate significance of smooth terms:
##                    edf Ref.df       F p-value
## s(sqft_living15) 5.043  6.174  999.67  <2e-16
## s(bathrooms)     7.290  7.977  183.47  <2e-16
## s(sqft_basement) 5.245  6.226   20.44  <2e-16
## s(bedrooms)      5.455  6.315   56.34  <2e-16
## s(lat)           8.837  8.991 1685.72  <2e-16
## s(long)          8.564  8.906   64.19  <2e-16
## s(zipcode)       8.943  8.998   59.53  <2e-16
AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc)
## [1] 618.0408
AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip)
## [1] 110.2283

Adding zipcode reduced the AIC score from 618 to 110 (85% less)

model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr = mgcv::gam(
  log(price) ~ s(sqft_living15) + 
    s(bathrooms) + 
    view + 
    s(sqft_basement) + 
    waterfront  + 
    s(bedrooms) + 
    floors + 
    s(lat) +
    s(long) + 
    s(zipcode) + 
    s(yr_built) + 
    s(yr_renovated)
  , data = db_train)

summary(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + 
##     waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) + 
##     s(yr_built) + s(yr_renovated)
## 
## Parametric coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 12.979657   0.008480 1530.634  < 2e-16 ***
## view         0.068875   0.002920   23.589  < 2e-16 ***
## waterfront1  0.499333   0.023419   21.322  < 2e-16 ***
## floors       0.035060   0.005546    6.321 2.66e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df       F p-value    
## s(sqft_living15) 5.162  6.300  979.93  <2e-16 ***
## s(bathrooms)     7.649  8.255  181.04  <2e-16 ***
## s(sqft_basement) 4.333  5.288   24.95  <2e-16 ***
## s(bedrooms)      5.189  6.053   52.52  <2e-16 ***
## s(lat)           8.824  8.990 1581.43  <2e-16 ***
## s(long)          8.548  8.900   61.11  <2e-16 ***
## s(zipcode)       8.941  8.998   51.51  <2e-16 ***
## s(yr_built)      8.795  8.987   32.00  <2e-16 ***
## s(yr_renovated)  5.201  6.117   19.10  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.792   Deviance explained = 79.3%
## GCV = 0.057437  Scale est. = 0.057214  n = 17151
anova(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + 
##     waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) + 
##     s(yr_built) + s(yr_renovated)
## 
## Parametric Terms:
##            df      F  p-value
## view        1 556.43  < 2e-16
## waterfront  1 454.61  < 2e-16
## floors      1  39.96 2.66e-10
## 
## Approximate significance of smooth terms:
##                    edf Ref.df       F p-value
## s(sqft_living15) 5.162  6.300  979.93  <2e-16
## s(bathrooms)     7.649  8.255  181.04  <2e-16
## s(sqft_basement) 4.333  5.288   24.95  <2e-16
## s(bedrooms)      5.189  6.053   52.52  <2e-16
## s(lat)           8.824  8.990 1581.43  <2e-16
## s(long)          8.548  8.900   61.11  <2e-16
## s(zipcode)       8.941  8.998   51.51  <2e-16
## s(yr_built)      8.795  8.987   32.00  <2e-16
## s(yr_renovated)  5.201  6.117   19.10  <2e-16
AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip)
## [1] 110.2283
AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr)
## [1] -327.4509
model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond = mgcv::gam(
  log(price) ~ s(sqft_living15) + 
    s(bathrooms) + 
    view + 
    s(sqft_basement) + 
    waterfront  + 
    s(bedrooms) + 
    floors + 
    s(lat) +
    s(long) + 
    s(zipcode) + 
    s(yr_built) + 
    s(yr_renovated) + 
    condition
  , data = db_train)

summary(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + 
##     waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) + 
##     s(yr_built) + s(yr_renovated) + condition
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.737756   0.013572 938.518  < 2e-16 ***
## view         0.068774   0.002877  23.904  < 2e-16 ***
## waterfront1  0.498074   0.023075  21.585  < 2e-16 ***
## floors       0.030017   0.005469   5.489 4.11e-08 ***
## condition    0.073115   0.003233  22.618  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df       F p-value    
## s(sqft_living15) 5.212  6.352 1029.79  <2e-16 ***
## s(bathrooms)     7.316  8.002  172.08  <2e-16 ***
## s(sqft_basement) 4.384  5.343   17.66  <2e-16 ***
## s(bedrooms)      5.061  5.924   52.79  <2e-16 ***
## s(lat)           8.831  8.991 1658.24  <2e-16 ***
## s(long)          8.598  8.919   58.69  <2e-16 ***
## s(zipcode)       8.945  8.998   50.31  <2e-16 ***
## s(yr_built)      8.602  8.953   23.32  <2e-16 ***
## s(yr_renovated)  4.881  5.764   40.25  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.798   Deviance explained = 79.9%
## GCV = 0.055775  Scale est. = 0.055557  n = 17151
anova(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + 
##     waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) + 
##     s(yr_built) + s(yr_renovated) + condition
## 
## Parametric Terms:
##            df      F  p-value
## view        1 571.40  < 2e-16
## waterfront  1 465.91  < 2e-16
## floors      1  30.13 4.11e-08
## condition   1 511.56  < 2e-16
## 
## Approximate significance of smooth terms:
##                    edf Ref.df       F p-value
## s(sqft_living15) 5.212  6.352 1029.79  <2e-16
## s(bathrooms)     7.316  8.002  172.08  <2e-16
## s(sqft_basement) 4.384  5.343   17.66  <2e-16
## s(bedrooms)      5.061  5.924   52.79  <2e-16
## s(lat)           8.831  8.991 1658.24  <2e-16
## s(long)          8.598  8.919   58.69  <2e-16
## s(zipcode)       8.945  8.998   50.31  <2e-16
## s(yr_built)      8.602  8.953   23.32  <2e-16
## s(yr_renovated)  4.881  5.764   40.25  <2e-16
AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr)
## [1] -327.4509
AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond)
## [1] -831.1318
model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond.lot = mgcv::gam(
  log(price) ~ s(sqft_living15) + 
    s(bathrooms) + 
    view + 
    s(sqft_basement) + 
    waterfront  + 
    s(bedrooms) + 
    floors + 
    s(lat) +
    s(long) + 
    s(zipcode) + 
    s(yr_built) + 
    s(yr_renovated) + 
    condition + 
    s(sqft_lot15)
  , data = db_train)

summary(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond.lot)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + 
##     waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) + 
##     s(yr_built) + s(yr_renovated) + condition + s(sqft_lot15)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.678685   0.013373  948.09   <2e-16 ***
## view         0.067461   0.002782   24.25   <2e-16 ***
## waterfront1  0.471993   0.022330   21.14   <2e-16 ***
## floors       0.069845   0.005551   12.58   <2e-16 ***
## condition    0.073112   0.003117   23.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df       F p-value    
## s(sqft_living15) 5.718  6.868  571.97  <2e-16 ***
## s(bathrooms)     6.932  7.709  159.04  <2e-16 ***
## s(sqft_basement) 5.319  6.303   22.83  <2e-16 ***
## s(bedrooms)      8.504  8.853   25.79  <2e-16 ***
## s(lat)           8.777  8.985 1854.81  <2e-16 ***
## s(long)          8.907  8.995   71.03  <2e-16 ***
## s(zipcode)       8.951  8.999   52.16  <2e-16 ***
## s(yr_built)      8.349  8.879   47.85  <2e-16 ***
## s(yr_renovated)  4.669  5.525   44.31  <2e-16 ***
## s(sqft_lot15)    8.339  8.865  148.07  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.812   Deviance explained = 81.3%
## GCV = 0.051849  Scale est. = 0.051609  n = 17151
anova(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond.lot)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + 
##     waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) + 
##     s(yr_built) + s(yr_renovated) + condition + s(sqft_lot15)
## 
## Parametric Terms:
##            df     F p-value
## view        1 588.0  <2e-16
## waterfront  1 446.8  <2e-16
## floors      1 158.3  <2e-16
## condition   1 550.0  <2e-16
## 
## Approximate significance of smooth terms:
##                    edf Ref.df       F p-value
## s(sqft_living15) 5.718  6.868  571.97  <2e-16
## s(bathrooms)     6.932  7.709  159.04  <2e-16
## s(sqft_basement) 5.319  6.303   22.83  <2e-16
## s(bedrooms)      8.504  8.853   25.79  <2e-16
## s(lat)           8.777  8.985 1854.81  <2e-16
## s(long)          8.907  8.995   71.03  <2e-16
## s(zipcode)       8.951  8.999   52.16  <2e-16
## s(yr_built)      8.349  8.879   47.85  <2e-16
## s(yr_renovated)  4.669  5.525   44.31  <2e-16
## s(sqft_lot15)    8.339  8.865  148.07  <2e-16
AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond)
## [1] -831.1318
AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond.lot)
## [1] -2083.016

AIC Summary

We are going to show the summary of the AIC procedure we have followed:

library(formula.tools)

paste(model.gam.liv$formula %>% as.character(), "|AIC:", AIC(model.gam.liv))
## [1] "log(price) ~ s(sqft_living15) |AIC: 18098.268228551"
paste(model.gam.liv.bat$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms) |AIC: 16104.5097962339"
paste(model.gam.liv.bat.view$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms) + view |AIC: 15254.4711858212"
paste(model.gam.liv.bat.view.cbase$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view.cbase))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) |AIC: 14858.4149482997"
paste(model.gam.liv.bat.view.cbase.wtf$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view.cbase.wtf))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + waterfront |AIC: 14743.4713138703"
paste(model.gam.liv.bat.view.cbase.wtf2$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view.cbase.wtf2))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms, by = waterfront) + view + s(sqft_basement) + waterfront |AIC: 14737.6882662845"
paste(model.gam.liv.bat.view.cbase.wtf2.bed$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view.cbase.wtf2.bed))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms, by = waterfront) + view + s(sqft_basement) + waterfront + s(bedrooms) |AIC: 14679.0533318368"
paste(model.gam.liv.bat.view.cbase.wtf2.bed.flo$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view.cbase.wtf2.bed.flo))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms, by = waterfront) + view + s(sqft_basement, by = waterfront) + waterfront + s(bedrooms) + floors |AIC: 14068.3004361546"
paste(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + waterfront + s(bedrooms) + floors + s(lat) + s(long) |AIC: 618.040750878333"
paste(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) |AIC: 110.228331770422"
paste(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) + s(yr_built) + s(yr_renovated) |AIC: -327.45089088778"
paste(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) + s(yr_built) + s(yr_renovated) + condition |AIC: -831.131766486816"
paste(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond.lot$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond.lot))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) + s(yr_built) + s(yr_renovated) + condition + s(sqft_lot15) |AIC: -2083.01626285448"
db_test = db_test_raw %>% 
  mutate(
    bathrooms = as.numeric(bathrooms)
    , floors = as.numeric(floors)
    , waterfront = as.factor(as.numeric(as.factor(waterfront)) - 1)
    , lat = as.numeric(lat)
    , long = as.numeric(long)
  ) %>% 
  mutate(
    lat = lat / (roundUp(abs(lat / 47))/10)
    , long = long / (roundUp(abs(long / 120))/10)
  ) %>% 
  dplyr::select(-id, -date) %>% 
  data.frame()

Accuracy over test set

db_test = db_test %>% 
  mutate(
    sqft_living = log(sqft_living)
    , sqft_lot = log(sqft_lot)
    , sqft_above = log(sqft_above)
    , sqft_living15 = log(sqft_living15)
    , sqft_lot15 = log(sqft_lot15)
  )

db_test$prediction = predict(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond.lot, db_test)


db_test %>% ggplot(aes(x = log(price), y = prediction)) +
  geom_point() +
  CSGo::theme_csgo()

This model predicts the price of the property really well as we can see in the scatter plot showed above. The points generated from the actual value and the predicted are around the diagonal, being a really good sign.

This would be our best hedonic pricing model!

summary(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond.lot)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + 
##     waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) + 
##     s(yr_built) + s(yr_renovated) + condition + s(sqft_lot15)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.678685   0.013373  948.09   <2e-16 ***
## view         0.067461   0.002782   24.25   <2e-16 ***
## waterfront1  0.471993   0.022330   21.14   <2e-16 ***
## floors       0.069845   0.005551   12.58   <2e-16 ***
## condition    0.073112   0.003117   23.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df       F p-value    
## s(sqft_living15) 5.718  6.868  571.97  <2e-16 ***
## s(bathrooms)     6.932  7.709  159.04  <2e-16 ***
## s(sqft_basement) 5.319  6.303   22.83  <2e-16 ***
## s(bedrooms)      8.504  8.853   25.79  <2e-16 ***
## s(lat)           8.777  8.985 1854.81  <2e-16 ***
## s(long)          8.907  8.995   71.03  <2e-16 ***
## s(zipcode)       8.951  8.999   52.16  <2e-16 ***
## s(yr_built)      8.349  8.879   47.85  <2e-16 ***
## s(yr_renovated)  4.669  5.525   44.31  <2e-16 ***
## s(sqft_lot15)    8.339  8.865  148.07  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.812   Deviance explained = 81.3%
## GCV = 0.051849  Scale est. = 0.051609  n = 17151